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1. INTRODUCTION 
The advancement of information technology (IT) and particularly the Web has impressively 
expanded the accessibility of data and leads thus to the rising of plagiarism. Plagiarism is a practice of taking 
someone else's work or ideas and passing them off as one's own. Several plagiarism techniques are performed 
by some dishonest authors, and here bellow some of them [1-2]: 
—  Copy-paste, textually (word by word): the content of the text is copied from one or more sources and 
could be slightly modified. 
— Paraphrasing: the grammar of the text is changed the, words are changed by their synonyms. The 
sentences are reorganized from the original work and some parts of the text are deleted. 
— False references, references are changed and sometimes are false or that do not even exist. 
— Plagiarism with translation, the contents are translated and used without reference to the original work. 
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— Plagiarism of ideas, it is the most difficult plagiarism to detect because it is more complicated than the 
previous types, because it is not simple manipulations made on the text, but a more advanced form 
which could include all the other techniques. 

In general, we can classify the plagiarism techniques on three strategies: lexical, syntaxial and 
semantic methods. The plagiarism of ideas most often incorporates reformulations as well as semantic and 
lexical changes which make it very hard to detect [3]: 

The Lexical methods consider text as a sequence of characters or terms [4]. The pre-processing 
technique includes tokenization, lowercasing, punctuation removal and stemming [5]. The more common 
terms the documents have, the more similar they are. Methods such as longest common subsequence, 
n-grams and fingerprint are considered as this kind of methods. The comparison units adopted include words, 
sentences, human defined sliding window or an n-gram [6-12]. The Syntactical methods use text’s syntactical 
units for comparing the similarity between documents. Implicitly, we consider that similar documents would 
have similar syntactical structure. This method makes use of characteristics such as POS tag to compare the 
similarity between different documents [13]. The Semantic methods use a semantic similarity for comparing 
documents. In this approach, different semantic features which include (Synonyms, hyponyms, hypernyms, 
semantic dependencies) [2-3] are extracted from the source documents and then used to trace out the 
plagiarism case from the corpus. The plagiarism detection is considered as a part of Natural Language 
Processing (NLP). Hence, based on NLP techniques many solutions have been proposed for lexical or 
Syntactical plagiarism, and most are based on the concept extraction using a corpus like WordNet [14-16]. 

With the classical approaches, two documents that share the same words are considered similar, and 
the word order is not respected which will make loss of the true meaning of a document. In recent years, deep 
learning techniques have been the subject of several researches and in different domains, from pattern 
recognition to NLP problems. The high performance obtained are very encouraging and make it possible to 
consider the use of these techniques in the field of plagiarism detection [17-18]. The techniques based on 
Deep Learning for plagiarism detection, include not only the contextual (semantic) level of the document but 
olso the syntactical and lexical level in vector representation. The remainder of this paper is organized as 
follows. The first section presents background concept. The second section defines related work. The third 
section contains a deep analyse concerning our comparison study. The last section introduces the conclusion 
and future work. 


2. RESEARCH METHOD 

In this section we will mention the different techniques used by the plagiarism detection approaches, 
whether in terms of its representation of its texts or the methods those calculate the similarity: 
a. Neural network based models 
Word embeddings are a type of word representation which stores the contextual information in a low- 
dimensional vector. This approach gained extreme popularity with the introduction of Word2Vec in 2013, 
groups of models to learn the word embeddings in a computationally efficient way. And Doc2Vec can be 
seen an extension of Word2Vec whose goal is to create a representational vector of a document or paragraph. 
Word2vec: is a model using neural network used to produce a distributed representation of word. Some 
researcher says that is not deep learning technique, because it is simple bi-layered neural network 
architecture. This model is shallow, two-layer neural networks that are trained to reconstruct linguistic 
contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically of 
several hundred dimensions, with each unique word in the corpus [19]. 
Doc2vec: Doc2vec is an unsupervised algorithm to generate vectors representation of sentences, paragraphs 
and documents [20]. Its model is based on Word2Vec, with only adding another vector (paragraph ID) to the 
input. The architecture of Doc2Vec model is shown Figure 1. 
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Figure 1. Doc2vec architecture 
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Instead of using just nearby words to predict the word, we also added another feature vector, which 
is document-unique. 
b. Deep learning based models 
Deep learning is a set of learning methods attempting to model data with complex architectures combining 
different non-linear transformations. The elementary bricks of deep learning are the neural networks that are 
combined to form the deep neural networks. There exist several types of architectures for neural networks: 
Recursive neural networks (RNN): have been successful, for instance, in learning sequence and tree 
structures in natural language processing, mainly phrase and sentence continuous representations based on 
word embedding [21]. 
Siamese LSTM for Learning documents Similarity: LSTM is a king of recurrent neural network and it is 
great when we have an entire sequence of words or sentences. This is because RNNs can model and 
remember the relationships between different words and sentences. Manhattan LSTM models have two 
networks LSTMleft and LSTMright which process one of the sentences in a given pair independently. 
Siamese LSTM, a version of Manhattan LSTM where both LSTMleft and LSTMright have same tied weights 
such that LSTMleft = LSTMright. Such a model is useful for tasks like duplicate query detection and query 
ranking. Here, duplicate detection task is performed to find if two documents are similar or not. Similar 
model can be trained for query ranking using hit data for a given query and its matching results as a proxy for 
similarity [21]. 
Convolutional neural network: CNN is a class of deep, feed-forward artificial neural networks that uses a 
variation of multilayer perceptions designed to require minimal preprocessing. These are inspired by animal 
visual cortex. CNNs are generally used in computer vision; however, they have recently been applied to 
various NLP tasks like a text classification [21]. 
Deep Structured Semantic Model (DSSM): DSSM stands for Deep Structured Semantic Model, or more 
general, Deep Semantic Similarity Model. It is a deep neural network (DNN) modelling technique for 
representing text strings (sentences, queries, predicates, entity mentions, etc.) in a continuous semantic space 
and modelling semantic similarity between two text strings. 
c. Other models 
Other methods used to construct a vector representation of a given text can be found: 
GLOVE: is an unsupervised learning algorithm for obtaining vector representations for words. Training is 
performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting 
representations showcase interesting linear substructures of the word vector space [22]. 
InferSent: is a sentence embeddings method that provides semantic representations for English sentences. It 
is trained on natural language inference data and generalizes well to many different tasks [22]. 
d. Similarity methods 
Finding similarity between elements is the core of sentence similarity. In the literature, there are many 
metrics for calculating similarity. This section shows different approaches used to calculate similarity 
between elements: 
Cosine similarity: is a measure of similarity between two non-zero vectors of an inner product space that 
measures the cosine of the angle between them. The cosine of 0° is 1, and it is less than 1 for any angle in the 
interval [0, 2] radians [23]. 
Jaccard index: also known as Intersection over Union and the Jaccard similarity coefficient (originally given 
the French name coefficient de community by Paul Jaccard), is a statistic used for gauging the similarity and 
diversity of sample sets. The Jaccard coefficient measures similarity between finite sample sets and is defined 
as the size of the intersection divided by the size of the union of the sample sets [23]. 
Euclidean Distance: refers to Euclidean distance. When data is dense or continuous, this is the best 
proximity measure. The Euclidean distance between two points is the length of the path connecting them, and 
it is obtained with the Pythagorean Theorem [23]. 
Longest common subsequence (LCS) method: consists of finding the longest subsequence common to all 
sequences in a set of sequences. The longest common subsequence problem is a classic computer science 
problem, the basis of data comparison programs such as the diff utility and has applications in computational 
linguistics and bioinformatics [24]. 
Word Mover’s Distance (WMD): uses word embeddings to calculate the similarities, and precisely, it uses 
normalized Bag-of-words and word Embeddings to calculate the distance between documents [25]. 


3. RELATED WORK 

Our study focuses on the detection of semantic plagiarism more precisely the identification of the 
plagiarism of ideas between two given texts, as illustrated below we dug on methods that detect this type of 
plagiarism: 
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In [26] proposed a plagiarism detection system, which rely on use sentences comparison in two 
phases. They first extract word vectors by word2vec algorithm, and then remove Persian stop words while 
text pre-processing. After that, for each sentence an average of all word vectors is calculated. After feature 
extraction, in phase 1, each sentence in a suspicious document is compared with all the sentences in the 
source documents. Cosine similarity is used as a comparison metric. After this step which helps to find the 
nearest sentences in real time, in phase 2, lexical similarity of two sentences is evaluated by the Jaccard 
similarity measure. Two sentences which pass Jaccard similarity threshold considered as plagiarism at final 
step. In [27] proposed the use word2vec model in order to compute vector of features for every word. They 
choose documents from the corpus itself, however the documents used for testing was processed and the pre- 
processing that was made is stop words removal. The similarity between vectors was computed by using 
cosine similarity. [24] The aim of this approach is evaluating the validity of using the distributed 
representation to define the word similarity. They introduce three methods based on the following three 
document similarities: for two documents: The length of the longest common subsequence (LCS) divided by 
the length of the shorter document, the local maximal value of the length of LCS, and the local maximal 
value of the weighted length of LCS. The distributed representation was obtained from no particular data by 
word2vec. 

Another approach uses the principle of Deep Structured Semantic Model (DSSM) proposed by [28]. 
DSSM is a deep learning-based technique that is proposed for semantic understanding of textual data. It maps 
short textual strings, such as sentences, to feature vectors in a low-dimensional semantic space. Then the 
vector representations are utilized for document retrieval by comparing the similarity between documents and 
queries. After obtaining the semantic feature vectors for each paired snippets of text, cosine similarity is 
utilized to measure the semantic similarity between the pair. Similarly, with the previous methods, in [29] 
deep learning documents or texts can be represented as vectors by the using document to vector technique 
(doc2vec). And the detection of plagiarism will be done by a simple comparison between all sentences of 
each two documents analysed. 

The approach proposed in [30] is based on converting a paragraph to vectors and it's inspired by the 
methods for learning the word vectors. The inspiration is that the word vectors are asked to contribute to a 
prediction task about the next word in the sentence. So, despite the fact that the word vectors are initialized 
randomly, they can eventually capture semantics as an indirect result of the prediction task. It will use this 
idea in their paragraph vectors in a similar manner. The paragraph vectors are also asked to contribute to the 
prediction task of the next word given many contexts sampled from the paragraph. 

These approaches [29-30] are used to perform similarity detection between the document vectors 
but also use the cosine to compare the vectors. In paper [31] they represent each word w by a vector. It 
constructs these word vectors using GloVe. This approach uses the recursive neural networks algorithm to 
have a vector representation of a sentence and use the cosine for calculate the similarity. In [32] two input 
sentences are processed in parallel by identical neural networks, outputting sentence representations. The 
sentence representations are compared by the structured similarity measurement layer. The similarity features 
are then passed to a fully-connected layer for computing the similarity score. Cosine distance measures the 
distance of two vectors according to the angle between them. The use of cosine to detect similarity between 
sentences remains a solution that carries many risks. InferSent [22] is an NLP technique for universal 
sentence representation developed by Facebook that uses supervised training to produce high transferable 
representations. They used a Bi-directional LSTM with attention that consistently surpassed many 
unsupervised training methods such as the SkipThought vectors. They also provide a Pytorch implementation 
that they used to generate sentence embedding. So, this approach needs to define a similarity measure to 
compare two vectors, and for that goal, itll be the cosine similarity. 

The authors in [33] used word embedding, vector representations of terms, computed from 
unlabelled data, that represent terms in a semantic space in which proximity of vectors can be interpreted as 
semantic similarity. They propose to go from word-level to text-level semantics by combining insights from 
methods based on external sources of semantic knowledge with word embedding. They derive multiple types 
of meta-features from the comparison of the word vectors for short text pairs, and from the vector means of 
their respective word embedding. The features representing labelled short text pairs are used to train a 
supervised learning algorithm. In [25] present the Word Mover’s Distance (WMD), a novel distance function 
between text documents. This work is based on recent results in word embedding that learn semantically 
meaningful representations for words from local co-occurrences in sentences. The WMD distance measures 
the dissimilarity between two text documents as the minimum amount of distance that the embedded words 
of one document need to “travel” to reach the embedded words of another document. This article [34] 
proposed an innovative word embedding-based system devoted to calculating the semantic similarity in 
Arabic sentences. The main idea is to exploit vectors as word representations in a multidimensional space in 
order to capture the semantic and syntactic properties of words. IDF weighting and Part-of-Speech tagging 
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are applied on the examined sentences to support the identification of words that are highly descriptive in 
each sentence. 

In paper [35] they address the issue of finding an effective vector representation for a very short text 
fragment. By effective they mean that the representation should grasp most of the semantic information in 
that fragment. For this, they use semantic word embedding to represent individual words, and we learn how 
to weigh every word in the text through the use of tf-idf (term frequency-inverse document frequency) 
information to arrive at an overall representation of the fragment comparing two tf-idf vectors is done 
through a standard cosine similarity. [36] This paper investigates the effectiveness of several such naive 
techniques, as well as traditional tf-idf similarity, for fragments of different lengths. This main contribution is 
a first step towards a hybrid method that combines the strength of dense distributed representations -as 
opposed to sparse term matching-with the strength of tf-idf based methods to automatically reduce the impact 
of less informative terms. This approach outperforms the existing techniques in a toy experimental set-up, 
leading to the conclusion that the combination of word embedding and tf-idf information might lead to a 
better model for semantic content within very short text fragments. Between two such representations they 
then calculate the cosine similarity. 

In the architecture proposed in [37], word embedding is first trained on API documents, tutorials, 
and reference documents, and then aggregated in order to estimate semantic similarities between documents 
where the similarity between vectors is usually defined as cosine similarity. In paper [38], they propose to 
combine explicit semantic analysis (ESA) representations and word2vec representations as a way to generate 
denser representations and, consequently, a better similarity measure between short texts. In [39] they 
proposed a semantic similarity approach for paraphrase identification in Arabic texts by combining different 
techniques of Natural Language Processing NLP such as: Term Frequency Inverse Document Frequency TF- 
IDF technique. The goal is to represent a word vector using word2vec. And also, to generate a sentence 
vector representation and after applying a similarity measurement operation based on different metrics of 
comparison, such as: Cosine Similarity and Euclidean Distance. This approach was evaluated on the Open 
Source Arabic Corpus OSAC and obtained a promising rate. 

[40] This paper proposes a novel deep neural network-based approach that relies on coarse-grained 
sentence modelling using a convolutional neural network and a long short-term memory model, combined 
with a specific fine-grained word-level similarity matching model. In this component, they represent every 
sentence using their joint CNN and LSTM architecture. The CNN is able to learn the local features from 
words to phrases from the text, while the LSTM learns the long-term dependencies of the text. More 
specifically, they firstly take the word embedding as input to their CNN model, in which various types of 
convolutions and pooling techniques are applied to capture the maximum information from the text. Next, the 
encoded features are used as input to the LSTM network. Finally, the long-term dependencies learned by the 
LSTM becomes the semantic sentence representation. 

[41] This approach proposes to explicitly model pairwise word interactions and present a novel 
similarity focus mechanism to identify important correspondences for better similarity measurement. They 
used GloVe word embeddings for vector representation of word and their model contains four major 
components: |. Bidirectional Long Short-Term Memory Net-works (Bi-LSTMs) are used for context 
modeling of input sentences. 2. A novel pairwise word interaction modeling technique encourages direct 
comparisons between word contexts across sentences. Cosine distance (cos) measures the distance of two 
vectors by the angle between them, while L2Euclidean distance (L2Euclid) and dotproduct distance 
(DotProduct) measure magnitude differences. We use three similarity functions for richermeasurement. 3. A 
novel similarity focus layer helps the model identify important pairwise word interactions across sentences.4. 
A layer deep convolutional neural network (ConvNet) converts the similarity measurement problem into a 
pattern recognition problem for final classification. 

The model of [42] is applied to assess semantic similarity between sentences. For these applications, 
they provide word-embedding vectors word2vec to the LSTMs, which use a fixed size vector to encode the 
underlying meaning expressed in a sentence (irrespective of the particular wording/syntax). By restricting 
subsequent operations to rely on a simple Manhattan metric, they compel the sentence representations 
learned by their model to form a highly structured space whose geometry reflects complex semantic 
relationships. [43] This paper proposes a model for com-paring sentences that uses a multiplicity of 
perspectives. We first model each sentence using a convolutional neural network that extracts features at 
multiple levels of granularity and uses multiple types of pooling. We then compare our sentence 
representations at several granularities using multiple similarity metrics (cos, LEuclid). We apply our model 
to three tasks, including the Microsoft Research paraphrase identification task and two SemEval semantic 
textual similarity tasks. 

In this paper [44], they present convolutional neural network architecture for reranking pairs of short 
texts, where they learn the optimal representation of text pairs and a similarity function to relate them in a 
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supervised way from the available training data. Their network takes only words in the input, thus requiring 
minimal preprocessing. In particular, they consider the task of reranking short text pairs where elements of 
the pair are sentences. They test our deep learning system on two popular retrieval tasks from TREC: 
Question Answering and Microblog Retrieval. [45] This system combines convolution and recurrent neural 
networks to measure the semantic similarity of sentences. It uses a convolution network to take account of 
the local context of words and an LSTM to consider the global context of sentences. This combination of 
networks helps to preserve the relevant information of sentences and improves the calculation of the 
similarity between sentences. According to this state of the art we have been able to detect the strengths and 
weaknesses of each approach that helped us to build our approach. The Table 1 represents a summary 


compared to the methods above: 


Table 1. Comparative table 


Approach Vector representation Level Similarity Dataset/resources Critical 
Word Sentence treatment method 
[26] Word2vec Average sentence Cosine, PAN 2016 Loss of the meaning of the 
Jaccard sentence. 
[27] Word2vec - word Cosine OSAC Arabic Two documents share the same 
corpus vectors could be non- 
[32] Word2vec - word Cosine Microsoft plagiarized. 
Research The use of cosine to detect 
Paraphrase similarity between sentences 
Corpus remains a solution that carries 
[33] Word2vec - word Cosine Microsoft Research many risks. 
Paraphrase Corpus Use cosine similarity to 
data set compute a similarity between 
[34] Word2vec - word Cosine Microsoft Research sentences. 
Video Description 
Corpus 
[35] Word2vec - word Cosine Wikipedia dataset 
[36] Word2vec tf- - word Cosine Wikipedia dataset 
idf 
[37] Word2vec - word Cosine Wiki corpus 
[38] Word2vec - word Cosine - 
[39] Word2vec - word Cosine Arabic Corpus 
Euclidean OSAC 
Distance 
[24] Word2vec - word LCS PAN 2013 LCS problem seeks a longest 
subsequence of every member 
of a given set of vectors, lose 
the semantic aspect. 
[28] Deep - word Cosine SemEval 2015 The treatment is at the level of 
Structured English STS sentence or small texts. 
Semantic 
Model 
(DSSM) 
[29] - Doc2vec sentence Cosine - Slowness of the system. 
The semantic aspect of a 
paragraph is lost because the 
[30] - Doc2vec sentence Cosine -Stanford sentiment comparison is done sentence 
treebank dataset by sentence. 
- IMDB 
dataset 
[31] GloVe Recursive sentence Cosine SemEval-2015 Task Use of doc2vec is better then 
neural 2 uses RNN. 
networks The semantic aspect of a 
paragraph is lost. 
[22] Word2vec InferSent sentence Cosine - The use of cosine to detect 


similarity between sentences 
remains a solution that carries 
many risks. The comparison is 
done at the sentence level, so 
we always encounter the 
problem of loss of the semantic 
aspect of the paragraph or text 
analysed. According to the 
study done by [31], [32] he 
found that the use of doc2vec 
gives trampling results. 
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Table 1. Comparative table (Continue) 


Approach Vector representation Level Similarity Dataset/resources Critical 
Word Sentence treatment method 
[25] Word2vec - word WMD BBCSPOR-T This method is used just to 


detect the similarity between 
small sentences. 


[40] Glove - word CNN-RNN SemEval 2015 These methods are based on 
[45] Word2vec - word CNN- SICK the vector representation of 
LSTM words, so they are used only 
[41] Glove - word Lstm -2014 SemEval for the detection of similarity 
Cnn -Microsoft Video between sentences but not 
DotProduct Paraphrase Corpus texts. 
L2Euclid -WikiQA 
[42] Word2vec - word LSTM SICK 
[43] Word2vec - word CNN -SemEval Always we encounter the 
-Microsoft Research problem of level representation 
paraphrase of the analysed data; the 
[44] Word2vec - word CNN TREC : Answering representation by word poses 
and Microblog the problem that we can just 
Retrieval analyse the small sentences. 


CNN's use of treating the 
similarity between list of word 
poses several problems like the 

loss of semantics level of the 
sentence construct. 


In addition to that we could detect the most powerful methods used for the representation of a text. 
It has been found that the use of the doc2vec principle remains the most relevant solution from the [29-30] 
study, and then we went further and took inspiration from it to build our learning system that detects 
plagiarism between the documents. 


4. RESULTS AND DISCUSSION 
In this part we will analyse the results found in the study carried out above, first we will illustrate 
the most important comparison criteria defined: 


Vector representation: This is a treatment performed on a text that will transform it to list of vectors which 
keep the semantic and syntactic aspect offered by the use of deep learning algorithms. 


Level treatment: this criterion defines the level of the treatment of a text, more exactly if the text is treated 
by word or by sentence. 


Similarity method: This part deals with the approaches used for calculating the similarity between the 
vectors that represent the texts, which will give us a global visibility to detect the strengths and weaknesses 
of each method. In addition, we are going to talk about the critical point for each approach illustrated in the 
paragraph above. Starting from the methods used for the vector representation of a text, according to the 
analysis it turns out that most of the approaches use either the word2vec or the doc2vec for its vector 
transformation, so we distinguish that the mikolov representations are the best methods used to keep the 
semantic aspect of a given text. In Revenge, Each Approach treats the text with its own way, there are some 
who transform it into a list of words and someone into a list of sentences, these representations yield results 
that differ from one approach to another but the transformation of a text to a list of sentences in our opinion 
remains the most relevant since the meaning of the text treated remains in consideration. With regard to the 
methods used for the similarity calculation, the preceding paragraphs mention the different ways used to 
detect whether there is a similarity or not between the analysed texts. There are also many approaches that 
work with CNN and RNN on its plagiarism detection architecture, but most of them use the word level for its 
vector representation, so they are used only for the detection of similarity between sentences but not texts. 

In conclusion, we found that almost of these approaches use the cosine to calculate the similarity 
between documents, so it was found that these methods perform its similarity analyses in word-by-word or 
sentence-by-sentence, which will pose after reliability problem of these results, since we can find two 
documents that share the same word or the same sentences but they are not semantically similar, in addition 
to that we can lose the semantic aspect when the documents are treating via a list of sentences or words. So, 
you have to think of a method that manages this problem by proposing an approach that will represent a text 
by a list of sentences that will eventually be transformed into a list of vectors, and in addition to that we must 
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use a treatment that keeps the semantic aspect of this list of sentences, so it well be a manipulation that 
processes a list of sentences to detect a similarity using an algorithm like the RNN that will keep the semantic 
aspect of a text. 


5. CONCLUSION 

In this paper, we have mentioned many different methods used in detection of plagiarism of ideas 
that stand for the principal of Deep Learning, and by this brilliant study we could construct our critical base 
of the previous weaknesses which we have seen during our study. This helped us to get a general idea about 
the different methods of deep learning used for plagiarism detection or especially semantic plagiarism 
detection. In addition to this, this study has given us the paths to follow for the construction of our approach 
by benefiting from the strengths of each method and bypassing the weak points of each method. Concerning 
the future work consists of construct and putting into practice our approach and comparing it with the other 
methods used at the level of the phase related work. 
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